Solving the Software 
Safety Paradox 



Embedded systems running safety-critical applications have a quandary. How can the software know that it's 
operating correctly? Can a malfunctioning system diagnose itself and either correct the problem or halt itself? 
This article examines some techniques that allow the software to monitor itself, with a bias toward those tech- 
niques that can be easily implemented in almost any embedded system. 



flflBB n Joseph Heller's novel Catch- 
22, an Army Air Corp doctor 
faces a perplexing dilemma. 
One of the WWII bomber 
squadron's pilots is obviously 
I8B9 insane, but he can't be 
grounded unless he asks to be ground- 
ed; however, if he asks to be grounded, 
then the rules state that this request in 
itself is evidence of sanity. In other 
words, he can't be grounded — Catch- 
22. 

Embedded systems running safety- 
critical applications face a similar para- 
dox. How can the software'know that it 
is operating correctly, especially if it 
isn't? Can the system perform a sanity 
check on itself? This has been a major 
concern for manufacturers and regula- 
tors of safety-critical embedded sys- 



tems since the beginning of the com- 
puter age, and it is a concern that is 
growing daily. 

From heart defibrillators to avionics 
suites, from nuclear power plant con- 
trols to the anti-lock braking system in 
most new cars, microprocessors have 
become an integral part of everyday 
systems upon which thousands, per- 
haps millions, of lives depend. The 
safe operation of these systems cannot 
be taken for granted due to the poten- 
tially catastrophic consequences of a 
malfunction. Given the serious nature 
of a failure in one of these systems, is 
there anything we can do, beyond 
careful design practices and extensive 
testing, to ensure their correct opera- 
tion? Can a malfunctioning system 
diagnose itself and either correct the 



problem or halt itself before behaving 
in an erroneous or potentially danger- 
ous manner? 

The answer to both questions is a 
qualified yes. Many simple techniques 
exist, which allow the software to mon- 
itor itself for correct operation. This 
article examines some of these tech- 
niques with a bias toward those that 
can be easily implemented in almost 
any embedded system. 

CPU self-test 

I have a friend who compares testing 
the CPU to holding a gun to your 
head, pulling the trigger, and then ask- 
ing, "How dead am I?" While I find his 
analogy amusing, I disagree with it. 

I admit that microprocessor failures 
are rare. They are so well-tested prior 
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"iHow can the software know that it is operating correctly, especially if 
it isn't? Can the system perform a sanity check on itself? 



to shipment that a CPU failure is prac- 
tically unheard of. Even when they do 
fail, it will most likely be a catastrophic 
failure of the entire processor; never- 
theless, it is possible for a micro- 
processor to fail in varying degrees, 
just like any other piece of hardware. 
In most cases, it isn't the microproces- 
sor itself, but an external anomaly, 
such as a solder splash, a bent pin, or 
an error in the gate array interface, 
that causes the failure. In one 
instance, the CPU of a robotic welder 
failed, causing an extraneous jump to 
a move routine that resulted in the 
destruction of a small jack.' One can 
only imagine the grim result if a 
human had been standing in place of 
the jack. 

So how can a software program test 
the CPU on which it is executing? 
Granted, a paradox exists in that the 
processor must be running well 
enough to execute the self-test, but by 
starting with very simple instructions, 
verifying their correct operation, and 
then proceeding to more complex 
instructions, testing the CPU is indeed 
possible. 

The need to test individual instruc- 
tions dictates that the CPU test must 
be written in machine assembler code 
instead of a higher-level language. 
Because branching is the main corner- 
stone of any program, it makes sense 
to start with a simple unconditional 
branch or jump instruction. If the 
branch instruction doesn't execute 
correcdy, then the CPU is inherendy 
unreliable. Listing 1 shows a portion 
of a CPU test for a 68332 microcon- 
troller. (The complete CPU test can be 
found on-line at wmw.embedded.com/ 
code.htm.) This test was written as a sub- 
routine that can be called from C 
functions, so some assumptions were 
made that the return instruction oper- 
ates correcdy. 



In concept, and in fact, a CPU self- 
test is quite simple. The branch always 
instruction should cause the micro- 
processor to jump around the f ai I 
code segment to the cpuO label. If the 
branch instruction fails, then the 
processor will fall into the f ai I seg- 
ment and execute the failure code. 
You should keep the failure code fair- 
ly simple because if this segment is 
ever executed, the CPU is assumed to 
be unreliable. This segment could sim- 
ply turn on an LED and then halt the 
processor. 

Once it has been confirmed that an 
instruction works correcdy, the subse- 
quent tests can use the instruction. 
The "Branch on equal/not equal 
zero" test illustrates this concept. If the 
beq instruction fails to execute cor- 
rectly, the processor uses the bra 
instruction to branch to the f ai I label. 
If the beq instruction executes correct- 
ly, the processor branches to the cpulO 
label and begins the next test, which 
can now use the clr, bne, and beq 
instructions. In this manner, each suc- 
cessive test builds upon the successes 
of the previous tests to perform ever- 
more complex operations. 

As a minimum, a CPU self-test 
should test each instruction that may 
be used by the application; verify the 
processor's ability to set, reset, and 
interrogate condition flags; verify the 
correct execution of math operations; 
and, finally, check the addressability 
and the read-write capabilities of each 
register. The register test should check 
for cross-talk between individual bits, 
bytes, and words. Use at least two bit 
patterns that are complementary to 
each other. 

Considering the complexity of 
some of today's microprocessors, test- 
ing every possible instruction in the 
CPU's instruction set may not be prac- 
tical. Fortunately, it's only necessary to 



test those instructions that may be 
used by the application. For example, 
if the CPU includes floating-point 
instructions but your application does 
not perform any floating-point opera- 
tions, then it isn't necessary, although 
it's still a good idea, to test the float- 
ing-point operations. To determine 
which instructions should be tested, 
writing a program that parses the out- 
put of the compiler and creates a list 
of all the generated instructions is a 
fairly simple process. 

CPU tests are typically run at 
power-on, following hardware initial- 
ization but before anything else, and 
during background processing. 
Because the test uses individual 
instructions, the time required to exe- 
cute the test can be determined with a 
high degree of precision by simply 
adding the execution time of the 
assembler instructions. 

In heavily regulated industries, 
such as medical device manufacturing, 
CPU self-tests also provide a substan- 
tial economic benefit. Some federal 
regulations require a test and inspec- 
tion of each component of a device. 
On manufacturing lines that use thou- 
sands of microprocessors each day, the 
cost of testing each microprocessor 
before installation is prohibitive. 
Implementing a CPU self-test avoids 
the cost of implementing a separate 
test line, while still satisfying federal 
regulations. 

Guarding against illegal 
jumps 

Probably the first, and simplest, safety 
technique learned by many embedded 
programmers consists of filling 
unused program memory with either 
halt instructions or illegal instructions. 
This technique guards against illegal » 
jumps outside of program space. 

Consider the case where unused 
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The idea is for the executing software to 
tickle the watchdog at appropriate intervals 
to prevent its expiration. 



ROM is filled with halt instructions. If 
the program jumped out of bounds 
and executed the halt instruction, the 
system will stop. This is extremely 
helpful if it occurs in a debugging 
environment, since you can trace back 
through the stack to determine the 
cause of the illegal jump. In a produc- 
tion environment, a watchdog timer 
could trigger a reset of the unit. 

If illegal instructions are used to fill 
unused program memory, ajump out- 
side of program space would execute 
the illegal instruction and trigger the 
illegal instruction trap. If the error 
occurred while you're in a debug envi- 
ronment, the illegal instruction vector 
can be used to start a debugging func- 
tion. If the error occurs in a produc- 
tion environment, the illegal instruc- 
tion trap could vector to a function 
that attempts to "safe" the system. For 
heart defibrillators, the safe function 
could disable the charger and transfer 
relays. For robotic systems, the safe 
function could shut down all the 
robot's motors. 

Regardless of which instruction is 
used, filling unused ROM provides 
cheap insurance againstjumps outside 
of code space. 

ROM tests 

A ROM test is used to verify the 
integrity of a program or data stored 
in ROM. It is a basic safety precaution 
that should be performed at power-on 
prior to executing the main applica- 
tion. If the ROM test fails, then we 
want to avoid executing the main 
application because we have no idea 
what is programmed in the device. 

In addition to performing the 
ROM test at power-on, running the 
ROM test continuously in the back- 
ground makes sense in some cases. 
This is especially true if your device 
has the ability to alter its own pro- 
gramming (for instance, flash memory 
orEEPROMs). : 

ROM tests use a checksum or a 
cyclic redundancy check (CRC) to 
guard against corrupted memory. The 
technique consists of calculating the 



checksum or CRC over the entire 
ROM space, including unused memo- 
ry, and storing the result in a known 
location. Unused ROM space should 
be filled with halt or illegal instruc- 
tions as I discussed previously. When 
the test is run, boot code calculates a 
checksum or CRC and compares the 
calculated value to the stored value. If 
the values don't compare, then the 
ROM has been corrupted. The basic 
ROM test is simple to comprehend 
and implement, but some careful 
trade-offs need to be made when 
deciding on whether to use a check- 
sum or CRC. 

A checksum is simple to implement 
and is fast; however, it's very easy for 
multiple errors to cancel each other 
out. A CRC, on the other hand, pro- 
vides a much better integrity check, 
but is slower. Some devices require a 
fast start and cannot tolerate the time 
required to complete a CRC check. 
One potential solution is to use both. 

Figure 1 illustrates a program 
memory layout. The last eight bytes 
are occupied by a stored CRC value 
and a checksum value respectively. 
The checksum is computed over the 
entire contents of ROM, including the 
CRC value. At power-up, the check- 
sum is calculated and compared to the 
stored value. This provides a quick 
integrity check that confirms the sys- 
tem can continue the boot process. 
Once the boot process is complete, a 
background task can run the CRC 
check continually. This provides a 
higher integrity check and also makes 
sure that the program ROM doesn't 
become corrupted during program 
execution. 

Watchdog timers 

In its simplest form, a watchdog timer 5 
is a circuit or function that counts 
down from a preset time until it 
expires or is reset (tickled). The idea 
is for the executing software to tickle 



the watchdog at appropriate intervals 
to prevent its expiration. If the soft- 
ware gets caught in a loop, or is pre- 
vented from tickling the watchdog, 
the timer expires and causes a proces- 
sor reset. In this form, a watchdog 
timer isn't stricdy a software tech- 
nique. It requires a hardware circuit to 
reset the processor if it isn't tickled 
prior to expiration; however, many 
real-time operating systems (RTOSes) 
provide a software watchdog capability 
that can be applied only to software 
tasks. On the hardware side, watchdog 
timers have become so common that 
many chips have a watchdog timer 
built in. 

Watchdog timers are most effective 
in cyclic systems where the watchdog is 
tickled each time through the loop. 
Watchdogs are somewhat more diffi- 
cult to implement in multitasking sys- 
tems due to their non-determinism.' 
I've seen some multitasking systems 
that use an interrupt to tickle the 
watchdog. This approach defeats the 
whole purpose for having one in the 
first place. If all the tasks were blocked 
and unable to run, the interrupt 
method would continue to service the 
watchdog and the reset would never 
occur. A better solution is to use a sep- 
arate monitor task that not only tickles 
the watchdog, but monitors the other 
system tasks as well. For example, a 
task that must run at least once per 
second could increment a counter. 
The monitor task reads the counter 
and "safes" the system if the task 
counter stops incrementing at the 
expected rate. If the monitor task was 
blocked, it would not tickle the watch- 
dog and the system would reset. 

A potential window of vulnerability 
exists with watchdog systems as they're 
implemented in most systems. Many 
systems only check for failure to strobe 
the watchdog too slow; if the system is 
running at a higher rate than intend- 
ed, the traditional watchdog won't 
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Portion of CPU test for a 68332 microcontroller 
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catch it. A better technique is to use a 
windowed watchdog. 

The windowed watchdog resets the 
processor if it is tickled either too fast 
or too slow. For example, if your task 
must nominally execute every 50 +/- 
10% milliseconds, a windowed watch- 
dog circuit would reset the processor 
if it was tickled before 45ms, or if 55ms 
passed without being tickled. 
Although windowed watchdogs are 
more difficult to implement, they pro- 
vide a higher degree of protection 
than standard watchdog timers. 

Redundant storage and 
cross checks of critical 
variables 

All of the techniques I have discussed 
to this point guard against gross errors 
in the system, but what about more 
subde errors, such as corruption of 
variables? 

We've all heard stories of bit flips 
that were caused by cosmic rays or 
EMI. Some engineers with whom I've 
spoken argue that cosmic ray bit flips 
are rare in today's modern memories. 
Other engineers argue that with bit 
densities going up and low-voltage 
devices becoming more prevalent, bit 
flips are more likely to occur than 
ever. Another, and perhaps more com- 
mon, cause of memory corruption is a 
rogue pointer, which can run wild 
through memory leaving a trail of cor- 
rupted variables in its wake. 
Regardless of the cause, the designer 
of safety-critical software must consid- 
er the threat that sometime, some- 
where, a variable will be corrupted. If 
one of the corrupted variables hap- 
pened to store a critical parameter, 
such as the energy setting for a heart 
defibrillator or the fluid level in a 
chemical mixing vat, the results could 
be quite unsatisfactory. 

A technique commonly used in 
safety-critical software to guard against 
variable corruption is redundant stor- 
age of critical variables. Three copies 
of each critical variable are stored in 
memory. Ideally, each copy should be 
stored in a separate memory type 
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Another, and perhaps more common, cause of memory corruption is a 
rogue pointer, which can run wild through memory, leaving a trail of 

corrupted variables in its wake. 



using different data formats. For 
example, most microprocessors have a 
small amount of onboard (internal) 
RAM in addition to the RAM designed 
onto the printed circuit board by the 
hardware designers. This provides two 
separate memories for storage of criti- 
cal variables. One copy can be stored 
in internal RAM and the two other 
copies can be stored in the external 
RAM. Of the two copies in external 
RAM, one is stored using an inverted 
bit pattern. 

Prior to using the variable, the 
three copies are compared. If any of 
the three variables do not agree, a two- 
of-three voting determines which 
value should be used. If none of the 
three values match, then a default 
value is used. 



An extra level of protection is pro- 
vided by grouping critical variables 
together and keeping a CRC over each 
group. In case of a mismatch, the CRC 
can be used to determine which copy 
is corrupted. The good copies can 
then be used to repair the corrupted 
copy. 

Two drawbacks are evident when 
keeping redundant copies of variables. 
One is the additional time required to 
access the variable copies and perform 
the integrity checks. The second draw- 
back is the shear number of variables 
used in embedded systems. Keeping 
redundant copies of all variables is 
impractical, so the designer must 
determine beforehand which variables 
are critical to the safe operation of the 
application. 
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Stack checks 

A stack check guards against stack 
overflow or the corruption of a task's 
stack. Stacks that overflow quickly are 
normally caught early in the develop- 
ment process; however, memory leaks 
can cause stacks to fill slowly until they 
eventually overflow. I once had a situa- 
tion in which a stack would overflow 
only after several days of continuous 
operation. A stack-monitoring task 
alerted me to the problem and 
allowed me to correct it prior to the 
product shipment 

Many commercial RTOSes contain 
facilities and functions that support 
stack checking, but it can be imple- 
mented without an RTOS as well. The 
basic technique consists of initializing 
the stack with a known hex pattern. 
One commercial RTOS pours COFFEE 
into the stack at initialization. 

er to 



This raises a troubling question: 

when doing a mathematical calculation, how do we know 
the result is correct? 



use a pattern that corresponds with 
an illegal instruction on the micro- 
processor. This is for the same rea- 
sons I mentioned earlier in guarding 
against illegal jumps. 

Once the stack has been initialized 
with a known pattern, a simple moni- 
tor function can be written to examine 
the stack, determine where the pat- 
tern has been overwritten, and calcu- 
late the remaining stack space. If the 
stack margin shrinks below some pre- 
determined amount, the monitor 
function will call an error processing 
routine. 

Stack monitors require little over- 
head to implement and are an effec- 
tive means of guarding against stack 
overflows. 

Program checks 

One of the primary functions of a 
microprocessor is computing the 
result of mathematical equations. The 
equations could be the calculations to 
determine the correct energy for a 
defibrillator shock, aircraft vectors in 
an air traffic control system, or control 
rod positions in a nuclear power plant. 
Whatever the purpose of the system, 
the correct computation of these 
equations is often central to its safe 
operation. This raises a troubling 
question: when doing a mathematical 
calculation, how do we know the result 
is correct? 

Most software development organi- 
zations try to address this problem by 
extensive testing. Testing is important 
and should not be omitted in favor of 
other methods, but it does have the 
shortcoming of being incomplete. All 
possible combinations of numbers 
cannot be covered. Another situation 
occurs if your production line starts 
using a different, yet supposedly com- 
patible, microprocessor. A program 
that used to operate correctly on the 
previous microprocessor fails when 
the upgraded part is installed. If you 
think this isn't a problem, remember 
the Pentium Floating-Point Bug? 
Programs that operated correcdy on a 
486 processor could fail when execut- 



ed on a Pentium processor. 

An alternative to testing is using 
mathematical proofs to prove that an 
algorithm is correct. The problem 



with proving correctness is the com- 
plexity involved in proving even the 
simplest algorithm. Additionally, a 
proof only proves that the algorithm is 
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To this day, I can still hear Mrs. Stewart 
admonishing my second grade class to "Check your work." 



correct. It cannot catch processor 
errors. 

A third, and probably the best alter- 
native, is to use program checks 4 to 
verify calculation results. Program 
checks address the basic problem of 
verifying that a calculation is correct 
before the result is utilized. The tech- 



nique is one that you were taught in 
elementary school. 

To this day, I can still hear Mrs. 
Stewart admonishing my second grade 
class to "Check your work." Check sub- 
traction by adding the result to one of 
the numbers in the original problem 
to get the other number, (for exam- 
ple, If 10 - 3 = 7 then 7 + 3 had better 
equal 10.) Check division by plugging 



your answer back into a multiply, and 
so on. In their simplest form, program 
checks are the same thing you were 
taught in grade school. 

When checking calculations, be 
sure to allow tolerances for checking 
your answers. Digital arithmetic isn't 
exact! The designer must decide on 



an error tolerance and only perform 
the error processing if the result falls 
outside of this tolerance. 

Program checks have two draw- 
backs. First, program checks roughly 
double your computation time 
because each calculation is essentially 
performed twice: once for the original 
calculation, and again, in reverse, to 
check the result In a system where 



processor throughput is an issue, 
checking all of the calculations may 
not be practical. Second, if the check 
signals a mismatch, we're not sure 
where the error has occurred, whether 
in the original calculation or in the 
check calculation. We only know that 
there is a problem and further analysis 
is necessary to determine the cause. 
Despite these drawbacks, program 
checks offer a powerful means for pro- 
grams to check themselves for correct 
operation. 

Simple, yet effective 

These safety self-tests I've described 
are only a few of the techniques that 
you can employ to have software check 
itself for proper execution — many oth- 
ers exist as well. The techniques here 
are some of the simplest, yet most 
effective. They can be easily imple- 
mented in most embedded systems. 
Although these safety self-tests can 
detect a bad microprocessor, bad 
ROM, or wild software, they certainly 
are not a guarantee of correct opera- 
tion. They can't diagnose design flaws, 
nor can they compensate for require- 
ments errors. Nevertheless, they are 
effective tools in designing safe sys- 
tems and are a welcome addition to a 
programmer's toolbox. esp 
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